Bank Marketing

1. Frame

A Portuguese Bank wants to run a direct marketing campaign to sell its new term deposit plan. The goal is to help them identify customers who would most likely buy the plan?

Open Discussion How to approach this problem?

2. Acquire

UCI has a number of datasets related to machine learning. We will leverage the Bank Marketing dataset. Look into this link for more information https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

Load the train and test datasets


In [1]:
train <- 
test <-

Attribute Information:

Input variables:

bank client data:

  1. age (numeric)
  2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
  3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
  4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
  5. default: has credit in default? (categorical: 'no','yes','unknown')
  6. housing: has housing loan? (categorical: 'no','yes','unknown')
  7. loan: has personal loan? (categorical: 'no','yes','unknown')
  1. contact: contact communication type (categorical: 'cellular','telephone')
  2. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
  3. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
  4. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

  1. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  2. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  3. previous: number of contacts performed before this campaign and for this client (numeric)
  4. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

social and economic context attributes

  1. emp.var.rate: employment variation rate - quarterly indicator (numeric)
  2. cons.price.idx: consumer price index - monthly indicator (numeric)
  3. cons.conf.idx: consumer confidence index - monthly indicator (numeric)
  4. euribor3m: euribor 3 month rate - daily indicator (numeric)
  5. nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

y - has the client subscribed a term deposit? (binary: 'yes','no')

3. Explore

Exercise


In [2]:
#Find frequency of deposit

In [6]:
# Train:
print("Train")
#Code here
print("Test")
#Code here


[1] "Train"

   no   yes 
31092  4119 
[1] "Test"

  no  yes 
8830 1170 

In [7]:
# Find number of rows and columns in train and test
print("Train")
#Code here
print("Test")
#Code here


[1] "Train"
Out[7]:
  1. 35211
  2. 17
[1] "Test"
Out[7]:
  1. 10000
  2. 17

In [9]:
# Find names of the features in train and test
print("Train")
#Code here
print("Test")
#Code here


[1] "Train"
Out[9]:
  1. 'age'
  2. 'job'
  3. 'marital'
  4. 'education'
  5. 'default'
  6. 'balance'
  7. 'housing'
  8. 'loan'
  9. 'contact'
  10. 'day'
  11. 'month'
  12. 'duration'
  13. 'campaign'
  14. 'pdays'
  15. 'previous'
  16. 'poutcome'
  17. 'deposit'
[1] "Test"
Out[9]:
  1. 'age'
  2. 'job'
  3. 'marital'
  4. 'education'
  5. 'default'
  6. 'balance'
  7. 'housing'
  8. 'loan'
  9. 'contact'
  10. 'day'
  11. 'month'
  12. 'duration'
  13. 'campaign'
  14. 'pdays'
  15. 'previous'
  16. 'poutcome'
  17. 'deposit'

Exercise

  1. Plot Deposit
  2. Plot Deposit Vs Age
  3. Plot Deposit Vs another feature
  4. Plot any two features against each other

In [ ]:


In [ ]:


In [ ]:


In [ ]:

4. Refine


In [13]:
#Print the head of train
head(train)


Out[13]:
agejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcomedeposit
158managementmarriedtertiaryno2143yesnounknown5may2611-10unknownno
244techniciansinglesecondaryno29yesnounknown5may1511-10unknownno
333entrepreneurmarriedsecondaryno2yesyesunknown5may761-10unknownno
447blue-collarmarriedunknownno1506yesnounknown5may921-10unknownno
533unknownsingleunknownno1nonounknown5may1981-10unknownno
628managementsingletertiaryno447yesyesunknown5may2171-10unknownno

Exercise


In [7]:
#Print the head of test dataset


Out[7]:
agejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcomedeposit
138self-employedsinglesecondaryno677yesnocellular14may1142-10unknownno
258blue-collarmarriedprimaryno5445yesnocellular14apr3911-10unknownno
355retiredmarriedsecondaryno5nonounknown20jun1081-10unknownno
426managementsinglesecondaryno63nonocellular28jul764-10unknownno
548techniciandivorcedtertiaryno907noyescellular4aug1031-10unknownno
633technicianmarriedtertiaryno525yesyesunknown28may1391-10unknownno

Find the class of each of the columns in train dataset


In [3]:
lapply(train, class)


Out[3]:
$age
'integer'
$job
'factor'
$marital
'factor'
$education
'factor'
$default
'factor'
$balance
'integer'
$housing
'factor'
$loan
'factor'
$contact
'factor'
$day
'integer'
$month
'factor'
$duration
'integer'
$campaign
'integer'
$pdays
'integer'
$previous
'integer'
$poutcome
'factor'
$deposit
'factor'

Exercise


In [11]:
#Find the classes for the features in test dataset. Does it match the train dataset?

In [ ]:

Exercise


In [14]:
#Find if train and test have missing values

In [15]:
#Does train has missing values?

In [ ]:


In [16]:
#Does test has missing values?

In [ ]:


In [19]:
#Find column names of train and test


Out[19]:
  1. 'age'
  2. 'job'
  3. 'marital'
  4. 'education'
  5. 'default'
  6. 'balance'
  7. 'housing'
  8. 'loan'
  9. 'contact'
  10. 'day'
  11. 'month'
  12. 'duration'
  13. 'campaign'
  14. 'pdays'
  15. 'previous'
  16. 'poutcome'
  17. 'deposit'

In [14]:
# Find summary statistics of train


Out[14]:
      age                 job           marital          education    
 Min.   :18.00   blue-collar:7560   divorced: 4086   primary  : 5315  
 1st Qu.:33.00   management :7382   married :21161   secondary:18053  
 Median :39.00   technician :5926   single  : 9964   tertiary :10406  
 Mean   :40.97   admin.     :4059                    unknown  : 1437  
 3rd Qu.:48.00   services   :3186                                     
 Max.   :95.00   retired    :1788                                     
                 (Other)    :5310                                     
 default        balance       housing      loan            contact     
 no :34571   Min.   : -8019   no :15663   no :29573   cellular :22817  
 yes:  640   1st Qu.:    71   yes:19548   yes: 5638   telephone: 2274  
             Median :   447                           unknown  :10120  
             Mean   :  1356                                            
             3rd Qu.:  1418                                            
             Max.   :102127                                            
                                                                       
      day           month          duration         campaign     
 Min.   : 1.0   may    :10751   Min.   :   0.0   Min.   : 1.000  
 1st Qu.: 8.0   jul    : 5398   1st Qu.: 103.0   1st Qu.: 1.000  
 Median :16.0   aug    : 4902   Median : 180.0   Median : 2.000  
 Mean   :15.8   jun    : 4123   Mean   : 258.2   Mean   : 2.759  
 3rd Qu.:21.0   nov    : 3126   3rd Qu.: 319.0   3rd Qu.: 3.000  
 Max.   :31.0   apr    : 2249   Max.   :4918.0   Max.   :63.000  
                (Other): 4662                                    
     pdays          previous           poutcome     deposit    
 Min.   : -1.0   Min.   :  0.0000   failure: 3803   no :31092  
 1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1425   yes: 4119  
 Median : -1.0   Median :  0.0000   success: 1184              
 Mean   : 40.1   Mean   :  0.5827   unknown:28799              
 3rd Qu.: -1.0   3rd Qu.:  0.0000                              
 Max.   :871.0   Max.   :275.0000                              
                                                               

In [15]:
#Find summary statistics of test


Out[15]:
      age                 job           marital         education    default   
 Min.   :18.00   blue-collar:2172   divorced:1121   primary  :1536   no :9825  
 1st Qu.:33.00   management :2076   married :6053   secondary:5149   yes: 175  
 Median :39.00   technician :1671   single  :2826   tertiary :2895             
 Mean   :40.83   admin.     :1112                   unknown  : 420             
 3rd Qu.:48.00   services   : 968                                              
 Max.   :89.00   retired    : 476                                              
                 (Other)    :1525                                              
    balance      housing     loan           contact          day       
 Min.   :-2827   no :4418   no :8394   cellular :6468   Min.   : 1.00  
 1st Qu.:   75   yes:5582   yes:1606   telephone: 632   1st Qu.: 8.00  
 Median :  450                         unknown  :2900   Median :16.00  
 Mean   : 1385                                          Mean   :15.82  
 3rd Qu.: 1448                                          3rd Qu.:21.00  
 Max.   :66721                                          Max.   :31.00  
                                                                       
     month         duration         campaign         pdays       
 may    :3015   Min.   :   0.0   Min.   : 1.00   Min.   : -1.00  
 jul    :1497   1st Qu.: 102.0   1st Qu.: 1.00   1st Qu.: -1.00  
 aug    :1345   Median : 180.0   Median : 2.00   Median : -1.00  
 jun    :1218   Mean   : 258.1   Mean   : 2.78   Mean   : 40.53  
 nov    : 844   3rd Qu.: 317.0   3rd Qu.: 3.00   3rd Qu.: -1.00  
 apr    : 683   Max.   :3785.0   Max.   :58.00   Max.   :805.00  
 (Other):1398                                                    
    previous          poutcome    deposit   
 Min.   : 0.0000   failure:1098   no :8830  
 1st Qu.: 0.0000   other  : 415   yes:1170  
 Median : 0.0000   success: 327             
 Mean   : 0.5721   unknown:8160             
 3rd Qu.: 0.0000                            
 Max.   :35.0000                            
                                            

In [17]:
#Find correlation between age and deposit in train dataset


Out[17]:
0.0351975918370962

In [20]:
#Find correlation between campaign and deposit in train dataset


Out[20]:
-0.0747459602733383

In [21]:
#What is the standard deviation of age in train?


Out[21]:
10.6511973326696

In [22]:
#What is the mean of age in train?


Out[22]:
40.9651529351623

In [ ]:

5. Model

Logistic Regression


In [30]:
# Logit Function

In [44]:
x <- seq(0,1, length=100)
x <- x[2:(length(x)-1)]

In [45]:
logit <- function (t) {
log( t / (1-t) )
}

In [46]:
plot(x~logit(x), type="l")



In [47]:
inv_logit <- function(x){
    exp(x)/(1+exp(x))
}

In [51]:
y <- seq(-100,100, length=200)

In [50]:
plot(y~inv_logit(y), type="l")


Exercise

Plot the inverse logit for values between -3 and +3

Logistic Regression Equation

Compare this with linear regression


In [ ]:


In [52]:
#Running the model on train

In [54]:
model <- glm(deposit~., family=binomial(link="logit"), data=train)

In [55]:
summary(model)


Out[55]:
Call:
glm(formula = deposit ~ ., family = binomial(link = "logit"), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.7236  -0.3718  -0.2509  -0.1490   3.2350  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -2.631e+00  2.087e-01 -12.606  < 2e-16 ***
age                 1.394e-03  2.493e-03   0.559 0.576186    
jobblue-collar     -2.997e-01  8.245e-02  -3.635 0.000278 ***
jobentrepreneur    -3.095e-01  1.407e-01  -2.200 0.027827 *  
jobhousemaid       -4.539e-01  1.573e-01  -2.886 0.003898 ** 
jobmanagement      -1.602e-01  8.359e-02  -1.917 0.055285 .  
jobretired          2.954e-01  1.094e-01   2.700 0.006931 ** 
jobself-employed   -3.636e-01  1.308e-01  -2.781 0.005420 ** 
jobservices        -2.161e-01  9.625e-02  -2.246 0.024730 *  
jobstudent          4.002e-01  1.244e-01   3.216 0.001298 ** 
jobtechnician      -1.877e-01  7.863e-02  -2.387 0.016978 *  
jobunemployed      -1.492e-01  1.260e-01  -1.184 0.236536    
jobunknown         -2.792e-01  2.611e-01  -1.069 0.284963    
maritalmarried     -1.658e-01  6.683e-02  -2.480 0.013125 *  
maritalsingle       1.267e-01  7.625e-02   1.661 0.096696 .  
educationsecondary  1.903e-01  7.326e-02   2.598 0.009385 ** 
educationtertiary   3.622e-01  8.549e-02   4.236 2.27e-05 ***
educationunknown    2.786e-01  1.180e-01   2.361 0.018246 *  
defaultyes         -2.536e-02  1.889e-01  -0.134 0.893185    
balance             1.144e-05  5.908e-06   1.936 0.052854 .  
housingyes         -6.768e-01  5.005e-02 -13.522  < 2e-16 ***
loanyes            -4.496e-01  6.814e-02  -6.598 4.18e-11 ***
contacttelephone   -1.209e-01  8.449e-02  -1.431 0.152432    
contactunknown     -1.608e+00  8.249e-02 -19.491  < 2e-16 ***
day                 1.088e-02  2.815e-03   3.866 0.000111 ***
monthaug           -7.135e-01  8.897e-02  -8.019 1.07e-15 ***
monthdec            8.035e-01  1.967e-01   4.086 4.40e-05 ***
monthfeb           -1.560e-01  1.018e-01  -1.533 0.125232    
monthjan           -1.473e+00  1.445e-01 -10.193  < 2e-16 ***
monthjul           -8.461e-01  8.765e-02  -9.654  < 2e-16 ***
monthjun            4.940e-01  1.058e-01   4.671 3.00e-06 ***
monthmar            1.549e+00  1.362e-01  11.374  < 2e-16 ***
monthmay           -4.446e-01  8.192e-02  -5.427 5.72e-08 ***
monthnov           -9.546e-01  9.687e-02  -9.854  < 2e-16 ***
monthoct            8.909e-01  1.233e-01   7.227 4.95e-13 ***
monthsep            8.471e-01  1.342e-01   6.311 2.78e-10 ***
duration            4.198e-03  7.313e-05  57.406  < 2e-16 ***
campaign           -9.653e-02  1.170e-02  -8.251  < 2e-16 ***
pdays               1.646e-04  3.426e-04   0.480 0.630937    
previous            1.212e-02  7.443e-03   1.628 0.103554    
poutcomeother       2.233e-01  1.017e-01   2.196 0.028124 *  
poutcomesuccess     2.275e+00  9.375e-02  24.270  < 2e-16 ***
poutcomeunknown    -6.327e-02  1.059e-01  -0.598 0.550143    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 25413  on 35210  degrees of freedom
Residual deviance: 16708  on 35168  degrees of freedom
AIC: 16794

Number of Fisher Scoring iterations: 6

In [56]:
#Predict on test
test_prediction <- predict(model, test, type="response")

In [57]:
head(test_prediction)


Out[57]:
1
0.0334536795812485
2
0.119641661275679
3
0.0553739182953198
4
0.0427454452591198
5
0.0382007472958081
6
0.00655807206116041

In [58]:
class(test_prediction)


Out[58]:
'numeric'

In [59]:
library(ROCR)


Loading required package: gplots

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

    lowess

Computing Error Metrics


In [64]:
# True Positive Rate
# False Positive Rate
# Area Under the Curve
# Precision
# Recall

In [66]:
pr <- prediction(test_prediction, test$deposit)

In [67]:
prf <- performance(pr, measure = "tpr", x.measure = "fpr")

In [79]:
plot(prf)



In [69]:
#Area under curve
auc <- performance(pr, measure = "auc")

In [70]:
auc <- auc@y.values[[1]]

In [71]:
auc


Out[71]:
0.902732138881628

In [80]:
#Precision and Recall
precision_recall <- performance(pr, "prec", "rec")
plot(precision_recall)